Load Data:

library(readr)

Untreated <- readRDS(paste0(wd,"/data/NCI_TPW_gep_untreated.rds"))
Treated <- readRDS(paste0(wd,"/data/NCI_TPW_gep_treated.rds"))

Filter Vorinostat data:


Functional Analysis of Biomarkers


In the specific analysis, we have already filtered out the biomarkers for treatment with Vorinostat and carried out first analyses with them. For example, we have investigated whether the biomarkers also include the targets of vorinostat indicated in the drug annotation. However, this was not the case. So it is of interest to investigate which functions our biomarkers have and in which pathways they are involved.

First, the functions of the biomarkers will be selected by hand, stored in a table and then examined. Next, we will present and apply some of the numerous enrichment methods. For this purpose we will use the R package “cluster profiler”. Here we will perform and compare enrichment analyses on “Gene Ontology”, “Reactome Pathway” and “KEGG (Kyoto Encyclopedia of Genes and Genomes) Pathway”. Finally, we will briefly discuss Gene Set Variation Analysis.

Table of content

1. Functional Analysis “by hand”

2. Functional Analysis using “cluster profiler”

3. Gene Set Variation Analysis

4. Conclusion/Discussion

5. Literatur


1. Functional Analysis “by hand”

We used human gene-databases such as “GeneCards” to do research about our biomarkers. We collected information about the gene-function as well as about the tissue in which the respective genes are mainly expressed. Unfortunately, we were not able to assign a function and tissue to each of the biomarkers since some genes were not sufficiently researched yet. That lead to quite a lot NA-values in our table. On the other hand, we found great numbers of different functions for single genes. We tried to summarize the functions but were not able to consider them all. Consequently, our table is only suitable for the purpose to get a general overview and to investigate, whether we have biomarkers with partly similar features.

1.1. Create a table with biomarker features

We used Excel to create a table with our top 50 biomarkers and their features, a part of the table can be seen below. Then we imported the annotation data in R and definded the tissue- and function-column for better readability:

vorinostat_annotation = read.csv2(paste0(wd,"/data/Biomarkers.csv"),header = TRUE, quote="\"")
tissue = vorinostat_annotation$affected.Tissue..if.specific.
general.function = vorinostat_annotation$general.Function
# rename 
names(vorinostat_annotation)[names(vorinostat_annotation)=="X...Gene"] <- "Gene" 
names(vorinostat_annotation)[names(vorinostat_annotation)=="affected.Tissue..if.specific."] <- "affected Tissue (if specific)" 

knitr::kable(vorinostat_annotation, caption = "vorinostat annotation")
vorinostat annotation
Gene Function affected Tissue (if specific) general.Function
FAM57A amino acid transport and glutathione metabolism lung transport
HIST1H2BD histone cluster NA chromatin binding
PMF1 chromosome segregation NA DNA replication
JADE2 histone acetyltransferase activity NA chromatin binding
LMNB1 lamin protein brain cytoskeleton
AQP3 water channel protein NA ion regulator
KIAA0368 proteasome adaptor and scaffold NA proteinbiosynthesis
ENPP2 ectonucleotide pyrophosphatase/phosphodiesterase germ cells phosphorylation
SYT11 calcium sensor brain neurotransmitter regulation
HIST1H3F histone cluster brain chromatin binding
VWA5A von Willebrand Factor blood coagulation
LOC728392 NA NA NA
TRIP6 hormone receptor NA NA
RUNDC3B NA NA NA
TMEM35B, ZMYM6 transmembrane protein NA membrane
MCM5 initiation of DNA replication skin chromatin binding
ZMYND11 transcriptional repressor brain regulation of transcritption
NFKB1 transcription factor immue system regulation of transcritption
PSMB10 proteasome NA proteinbiosynthesis
HIST2H2BE histone cluster immue system chromatin binding
AVEN apoptosis and caspase activation inhibitor brain cell growth
PHF11 PHD Finger Protein immue system chromatin binding
KIF5C transport processes brain transport
CORO1A cell cycle progression, signal transduction, apoptosis, and gene regulation immue system cell growth
MPDU1 Mannose-P-Dolichol utilization brain metabolism
FOS transcription factor subunit bones regulation of transcritption
DHRS2 dehydrogenase/reductase enzyme NA NA
ABAT GABA reduction brain neurotransmitter regulation
SERPINI1 synaptic plasticity brain neurotransmitter regulation
MIR612///NEAT1 transcription regulator of oncogenes NA regulation of transcritption
HBA2///HBA1 globin for hemoglobin blood oxygen transport
CLU chaperone brain proteinbiosynthesis
STC1 ion transport kidney ion regulator
NMI interacts with oncogenes blood regulation of transcritption
AREG growth factor brain cell growth
CRISPLD2 heparin binding and glycosaminoglycan binding immue system NA
NSMAF binds phospholipids NA NA
HIST1H1C histone cluster NA chromatin binding
SERPINH1 serine protease inhibitor in ER NA NA
HIST1H2BJ///HIST1H2BG histone cluster NA chromatin binding
SLC17A7 ion transport brain ion regulator
TYMS DNA-replication + repair NA DNA replication
ASMTL o-methyltransferase NA NA
TNFSF9 TNF, cytokine blood infammation
FAIM apoptose protection, regulates B-cell signals + differentiation blood cell growth
ADI1 downregulation of cell migration, HepC-Virus replication NA cell migration
TUBB2B Tubulin (microtubuli-component), neuronal migration brain cytoskeleton
HMGA2 transcription regulator, chromosome-condensation NA regulation of transcritption
RGL1 cytoskeleton remodeling NA cytoskeleton
NEU1 neuraminidase NA NA
HIST1H2BD histone cluster NA chromatin binding

1.2. Barplots of biomarker tissue and function

The following barplot shows the prevalence of tissues, in which 50 of our biomarkers are preferentially expressed.

col=palette(rainbow(9))
table(tissue)
## tissue
##        blood       bones         brain       brain    germ cells 
##            5            1           12            1            1 
## immue system       kidney         lung         skin 
##            5            1            1            1
barplot(table(tissue), ylab="counts", main="Affected tissues by biomarkers", col=col, las=3, cex.names = 0.7)

Vorinostat is used against T-cell lymphomas. They origin from T-cells which are a subgroup of white blood cells and of high importance for the immune defense. Thus, we expected Vorinostat to affect preferentially those areas. Some of our biomarkers are indeed mainly active in blood cells and in the immune system but we also found, that quite a lot of the biomarkers show activity in the brain.

col=palette(rainbow(16))
table(general.function) 
## general.function
##               DNA replication                   cell growth 
##                             2                             4 
##                cell migration             chromatin binding 
##                             1                             9 
##                  coagulation                   cytoskeleton 
##                             1                             3 
##                   infammation                 ion regulator 
##                             1                             3 
##                      membrane                   metabolism  
##                             1                             1 
##  neurotransmitter regulation               oxygen transport 
##                             3                             1 
##               phosphorylation          proteinbiosynthesis  
##                             1                             3 
## regulation of transcritption                      transport 
##                             6                             2
barplot(table(general.function), ylab="counts", main="General functions of biomarkers", col=col, las=3, cex.names = 0.7)

In order to get a better overview of the function, we can plot only those functions which occur with a frequency greater or equal to two:

highfrequencyfunctions <- table(general.function)[table(general.function)>2]

col=palette(rainbow(7))
barplot(highfrequencyfunctions, ylab="counts", main="Prevalent functions of biomarkers", col=col, las=3, cex.names = 0.7)

The plot indicates, that Vorinostat affects amongst other things especially genes which are involved in chromatin binding and transcription regulation. We already knew, that the drugs working mechanism is to create an altered chromatin structure and gene accessibility. However, we did not know, that those altered structures have a high impact on genes involved in chromatin-organisation as well.

In an other column in the self created annotation table we specified the function of some biomarkers more precise. Some of the chromatin-binding biomarkers function as histone components.

length(grep(vorinostat_annotation$Function, pattern = "histone cluster"))
## [1] 6

In 50 researched biomarkers we found six histone genes which show a clearly altered expression after Vorinostat treatment. That leads to the assumption, that beside Vorinistats working mechanism as a histone modification regulator it might affect the number or composition of histones as well.


2. Functional Analysis using “cluster profiler”

The R package “cluster profiler”, created by Guangchuang Yu et al. allows the user to create and visualize different functional profiles of genes and gene clusters using different enrichment methods. Here, we will focus on “Gene Ontology”, “Reactome Pathway” and “KEGG Pathway” methodes.

2.1. Gene Ontology Analysis

Gene Ontology is an enrichment Analysis, which hierarchically classifies genes or gene products to terms organized in an “Gene Ontology” system. The genes can be identified and ordered in three different ways (Yon Rhee et al., 2008):

  1. Biological Process: describing a general cellular or physiological role
  2. Molecular Function: describing the molecular activity
  3. Cellular Component: describing the location in the cell where the function of the gene is executed

In addition, a statistical test can be carried out on the assignment, which outputs a p-value. This value then indicates how significant the assignment has been. The smaller the value, the better the gene could be assigned to a group. For example, almost all genes can be assigned to a biological process. This assignment would therefore not be significant. In contrast, there are only a few genes that belong to the group “DNA repair”. The smaller the pValue of the assigned gene, the larger the association with this group. (http://geneontology.org/docs/go-enrichment-analysis/)

First of all, we need to load the libraries, we will work with:

library(clusterProfiler)
library(org.Hs.eg.db)

To be able to work with our biomarker genes in cluster profiler, we need to translate them into another Gene ID:

translated.genes= bitr(generalbiomarkergenes,
                        fromType="SYMBOL", 
                        toType="ENTREZID",
                        OrgDb = org.Hs.eg.db) 
head(translated.genes)
##     SYMBOL ENTREZID
## 1    DHRS2    10202
## 2     ABAT       18
## 3 SERPINI1     5274
## 6      CLU     1191
## 7     STC1     6781
## 8      NMI     9111

As a first step, we will assign the biomarkers to the groups described above. First we define the “input genes” with with the correct Gene ID:

# load the needed library
library(DOSE)
# take the needed gene ID
gene=translated.genes$ENTREZID
head(gene)
## [1] "10202" "18"    "5274"  "1191"  "6781"  "9111"

Now we can group them, using all 3 categories:

## 1. Biological Process
ggo.bp <- groupGO(gene= gene, OrgDb = org.Hs.eg.db,  ont = "BP", level = 3, readable = FALSE)
head(summary(ggo.bp))
##                    ID                              Description Count
## GO:0019953 GO:0019953                      sexual reproduction     4
## GO:0019954 GO:0019954                     asexual reproduction     0
## GO:0022414 GO:0022414                     reproductive process     9
## GO:0032504 GO:0032504      multicellular organism reproduction     4
## GO:0032505 GO:0032505 reproduction of a single-celled organism     0
## GO:0061887 GO:0061887         reproduction of symbiont in host     0
##            GeneRatio                                    geneID
## GO:0019953      4/90                          18/8061/330/8900
## GO:0019954      0/90                                          
## GO:0022414      9/90 18/6781/374/8644/8061/2353/330/10370/8900
## GO:0032504      4/90                        6781/8061/330/8900
## GO:0032505      0/90                                          
## GO:0061887      0/90
## 2. Molecular Function
ggo.mf <- groupGO(gene= gene, OrgDb = org.Hs.eg.db,  ont = "MF", level = 3, readable = FALSE)
head(summary(ggo.mf))
##                    ID
## GO:0001070 GO:0001070
## GO:0001072 GO:0001072
## GO:0001073 GO:0001073
## GO:0001134 GO:0001134
## GO:0003700 GO:0003700
## GO:0003711 GO:0003711
##                                                           Description
## GO:0001070               RNA-binding transcription regulator activity
## GO:0001072 transcription antitermination factor activity, RNA binding
## GO:0001073 transcription antitermination factor activity, DNA binding
## GO:0001134                transcription regulator recruiting activity
## GO:0003700                  DNA-binding transcription factor activity
## GO:0003711                transcription elongation regulator activity
##            Count GeneRatio
## GO:0001070     0      0/90
## GO:0001072     0      0/90
## GO:0001073     0      0/90
## GO:0001134     0      0/90
## GO:0003700    10     10/90
## GO:0003711     0      0/90
##                                                        geneID
## GO:0001070                                                   
## GO:0001072                                                   
## GO:0001073                                                   
## GO:0001134                                                   
## GO:0003700 8091/8061/2353/4790/10370/2969/2305/467/7157/23635
## GO:0003711
## 3. Cellular Component
ggo.cc <- groupGO(gene= gene, OrgDb = org.Hs.eg.db,  ont = "CC", level = 3, readable = FALSE)
head(summary(ggo.cc))
##                    ID                    Description Count GeneRatio
## GO:0005886 GO:0005886                plasma membrane    26     26/90
## GO:0005628 GO:0005628              prospore membrane     0      0/90
## GO:0005789 GO:0005789 endoplasmic reticulum membrane     4      4/90
## GO:0019867 GO:0019867                 outer membrane     1      1/90
## GO:0031090 GO:0031090             organelle membrane    17     17/90
## GO:0034357 GO:0034357        photosynthetic membrane     0      0/90
##                                                                                                                                               geneID
## GO:0005886 6781/57030/8744/55256/4758/79850/360/1490/5168/23208/7205/11151/8061/6515/80328/4900/10486/8829/1368/55915/6253/306/27131/27340/2014/4804
## GO:0005628                                                                                                                                          
## GO:0005789                                                                                                                        374/9526/6253/2281
## GO:0019867                                                                                                                                       637
## GO:0031090                                                  1191/374/57030/7298/4758/4001/23208/11151/51131/6515/4900/64921/57134/306/27131/637/2281
## GO:0034357

The result of the assignment can be visualized via a bar plot.

# visualization 
par(mar=c(5, 4, 5, 9))
par(mfrow=c(1,3))
barplot(ggo.bp, drop=TRUE, showCategory=12)

barplot(ggo.mf, drop=TRUE, showCategory=12)

barplot(ggo.cc, drop=TRUE, showCategory=12)

We see that we have assignments in all Barplots in which almost all belong. In biological process, for example, we have an excessive number of genes involved in the “nitrogen compound metabolic process”, in molecular function we have a large number with “hydrolase activity” and all genes localised in the “cell part”. Consequently, these bar plots are not very informative. For this reason we will now additionally check the statistical parameters by an enrichment analysis to see how significant our allocations are.

Here we also need to translate the Gene IDs first:

gene.df <- bitr(generalbiomarkergenes, fromType = "SYMBOL",
                toType = c("ENSEMBL"),
                OrgDb = org.Hs.eg.db)

For the enrichment analysis we have to set some important statistical parameters. As we expect to have small pValues we set the significance level to 0.1. Because we have multiple compairisons, we also have to use the Adjust p.Value Method. Here we use the Benjamini-Hochberg Method, which controls the false discovery rate (FDR) and guarantees a powerful test. As limit for our q-value we have chosen 0.05, which is used by default.

# zum nachlesen 
#http://www.nonlinear.com/support/progenesis/comet/faq/v2.0/pq-values.aspx
## 1. Biological Process
ego.bp <- enrichGO(gene        = gene.df$ENSEMBL,
                 OrgDb         = org.Hs.eg.db,
                 keyType       = 'ENSEMBL',
                 ont           = "BP",
                 pAdjustMethod = "BH",
                 pvalueCutoff  = 0.01,
                 qvalueCutoff  = 0.05)

head(summary(ego.bp))
##                    ID                         Description GeneRatio
## GO:0006689 GO:0006689       ganglioside catabolic process      8/95
## GO:0009313 GO:0009313   oligosaccharide catabolic process      8/95
## GO:0046479 GO:0046479 glycosphingolipid catabolic process      8/95
## GO:0019377 GO:0019377        glycolipid catabolic process      8/95
## GO:0046514 GO:0046514          ceramide catabolic process      8/95
## GO:0001573 GO:0001573       ganglioside metabolic process      8/95
##             BgRatio       pvalue     p.adjust       qvalue
## GO:0006689 14/20505 4.609982e-16 1.005437e-12 8.778376e-13
## GO:0009313 22/20505 4.762652e-14 3.462448e-11 3.023031e-11
## GO:0046479 22/20505 4.762652e-14 3.462448e-11 3.023031e-11
## GO:0019377 24/20505 1.087160e-13 5.927741e-11 5.175455e-11
## GO:0046514 25/20505 1.592733e-13 6.947503e-11 6.065799e-11
## GO:0001573 30/20505 8.457689e-13 3.074370e-10 2.684204e-10
##                                                                                                                                     geneID
## GO:0006689 ENSG00000204386/ENSG00000227315/ENSG00000234343/ENSG00000234846/ENSG00000227129/ENSG00000184494/ENSG00000228691/ENSG00000223957
## GO:0009313 ENSG00000204386/ENSG00000227315/ENSG00000234343/ENSG00000234846/ENSG00000227129/ENSG00000184494/ENSG00000228691/ENSG00000223957
## GO:0046479 ENSG00000204386/ENSG00000227315/ENSG00000234343/ENSG00000234846/ENSG00000227129/ENSG00000184494/ENSG00000228691/ENSG00000223957
## GO:0019377 ENSG00000204386/ENSG00000227315/ENSG00000234343/ENSG00000234846/ENSG00000227129/ENSG00000184494/ENSG00000228691/ENSG00000223957
## GO:0046514 ENSG00000204386/ENSG00000227315/ENSG00000234343/ENSG00000234846/ENSG00000227129/ENSG00000184494/ENSG00000228691/ENSG00000223957
## GO:0001573 ENSG00000204386/ENSG00000227315/ENSG00000234343/ENSG00000234846/ENSG00000227129/ENSG00000184494/ENSG00000228691/ENSG00000223957
##            Count
## GO:0006689     8
## GO:0009313     8
## GO:0046479     8
## GO:0019377     8
## GO:0046514     8
## GO:0001573     8
## 2. Molecular Function
ego.mf <- enrichGO(gene        = gene.df$ENSEMBL,
                 OrgDb         = org.Hs.eg.db,
                 keyType       = 'ENSEMBL',
                 ont           = "MF",
                 pAdjustMethod = "BH",
                 pvalueCutoff  = 0.01,
                 qvalueCutoff  = 0.05)

head(summary(ego.mf))
##                    ID                                          Description
## GO:0052794 GO:0052794                  exo-alpha-(2->3)-sialidase activity
## GO:0052795 GO:0052795                  exo-alpha-(2->6)-sialidase activity
## GO:0052796 GO:0052796                  exo-alpha-(2->8)-sialidase activity
## GO:0004308 GO:0004308                         exo-alpha-sialidase activity
## GO:0016997 GO:0016997                             alpha-sialidase activity
## GO:0004553 GO:0004553 hydrolase activity, hydrolyzing O-glycosyl compounds
##            GeneRatio   BgRatio       pvalue     p.adjust       qvalue
## GO:0052794      8/97  12/19626 1.291220e-16 1.480599e-14 1.377301e-14
## GO:0052795      8/97  12/19626 1.291220e-16 1.480599e-14 1.377301e-14
## GO:0052796      8/97  12/19626 1.291220e-16 1.480599e-14 1.377301e-14
## GO:0004308      8/97  14/19626 7.770352e-16 5.346002e-14 4.973025e-14
## GO:0016997      8/97  14/19626 7.770352e-16 5.346002e-14 4.973025e-14
## GO:0004553      9/97 105/19626 2.450757e-09 1.405101e-07 1.307071e-07
##                                                                                                                                                     geneID
## GO:0052794                 ENSG00000204386/ENSG00000227315/ENSG00000234343/ENSG00000234846/ENSG00000227129/ENSG00000184494/ENSG00000228691/ENSG00000223957
## GO:0052795                 ENSG00000204386/ENSG00000227315/ENSG00000234343/ENSG00000234846/ENSG00000227129/ENSG00000184494/ENSG00000228691/ENSG00000223957
## GO:0052796                 ENSG00000204386/ENSG00000227315/ENSG00000234343/ENSG00000234846/ENSG00000227129/ENSG00000184494/ENSG00000228691/ENSG00000223957
## GO:0004308                 ENSG00000204386/ENSG00000227315/ENSG00000234343/ENSG00000234846/ENSG00000227129/ENSG00000184494/ENSG00000228691/ENSG00000223957
## GO:0016997                 ENSG00000204386/ENSG00000227315/ENSG00000234343/ENSG00000234846/ENSG00000227129/ENSG00000184494/ENSG00000228691/ENSG00000223957
## GO:0004553 ENSG00000204386/ENSG00000227315/ENSG00000234343/ENSG00000234846/ENSG00000227129/ENSG00000184494/ENSG00000228691/ENSG00000223957/ENSG00000117643
##            Count
## GO:0052794     8
## GO:0052795     8
## GO:0052796     8
## GO:0004308     8
## GO:0016997     8
## GO:0004553     9
##  3. Cellular Component
ego.cc <- enrichGO(gene        = gene.df$ENSEMBL,
                 OrgDb         = org.Hs.eg.db,
                 keyType       = 'ENSEMBL',
                 ont           = "CC",
                 pAdjustMethod = "BH",
                 pvalueCutoff  = 0.01,
                 qvalueCutoff  = 0.05)

head(summary(ego.cc))
##                    ID               Description GeneRatio   BgRatio
## GO:0035580 GO:0035580    specific granule lumen      9/99  83/21794
## GO:0042581 GO:0042581          specific granule     11/99 213/21794
## GO:0034774 GO:0034774   secretory granule lumen     13/99 374/21794
## GO:0060205 GO:0060205 cytoplasmic vesicle lumen     13/99 392/21794
## GO:0031983 GO:0031983             vesicle lumen     13/99 393/21794
## GO:0043202 GO:0043202           lysosomal lumen      8/99 111/21794
##                  pvalue     p.adjust       qvalue
## GO:0035580 1.415185e-10 3.240774e-08 2.785680e-08
## GO:0042581 3.565349e-09 4.082325e-07 3.509054e-07
## GO:0034774 1.486233e-08 1.134491e-06 9.751775e-07
## GO:0060205 2.586890e-08 1.220798e-06 1.049364e-06
## GO:0031983 2.665498e-08 1.220798e-06 1.049364e-06
## GO:0043202 4.087960e-08 1.560238e-06 1.341138e-06
##                                                                                                                                                                                                                     geneID
## GO:0035580                                                                 ENSG00000204386/ENSG00000227315/ENSG00000234343/ENSG00000234846/ENSG00000227129/ENSG00000184494/ENSG00000228691/ENSG00000223957/ENSG00000109320
## GO:0042581                                 ENSG00000204386/ENSG00000227315/ENSG00000234343/ENSG00000234846/ENSG00000227129/ENSG00000184494/ENSG00000228691/ENSG00000223957/ENSG00000109320/ENSG00000059804/ENSG00000138772
## GO:0034774 ENSG00000163536/ENSG00000120885/ENSG00000103196/ENSG00000204386/ENSG00000227315/ENSG00000234343/ENSG00000234846/ENSG00000227129/ENSG00000184494/ENSG00000228691/ENSG00000223957/ENSG00000109320/ENSG00000109107
## GO:0060205 ENSG00000163536/ENSG00000120885/ENSG00000103196/ENSG00000204386/ENSG00000227315/ENSG00000234343/ENSG00000234846/ENSG00000227129/ENSG00000184494/ENSG00000228691/ENSG00000223957/ENSG00000109320/ENSG00000109107
## GO:0031983 ENSG00000163536/ENSG00000120885/ENSG00000103196/ENSG00000204386/ENSG00000227315/ENSG00000234343/ENSG00000234846/ENSG00000227129/ENSG00000184494/ENSG00000228691/ENSG00000223957/ENSG00000109320/ENSG00000109107
## GO:0043202                                                                                 ENSG00000204386/ENSG00000227315/ENSG00000234343/ENSG00000234846/ENSG00000227129/ENSG00000184494/ENSG00000228691/ENSG00000223957
##            Count
## GO:0035580     9
## GO:0042581    11
## GO:0034774    13
## GO:0060205    13
## GO:0031983    13
## GO:0043202     8

The results of the enrichment can be visualized in many different ways. We will now present some possible plots based on biological function:

# Dot Plot 
dotplot(ego.bp, showCategory=12)

This plot is similar to a bar plot, except that the bars were replaced by points. The size of the points indicates how many genes there are in this category and the color gives information about the pValue. In contrast to the barplot, less genes per category are displayed here. Since we have defined a p and q value, these are only “significant” categories.We see that categories with fewer genes have a smaller pValue and have therefore been better assigned.

# Gene-Concept Network
cnetplot(ego.bp)

The Gene-Concept Network shows us in which categories a gene occurs. Also here the size of the points is proportional to the number of genes in the corresponding category.

# Enrichment Map for enrichment result of over-representation test or gene set enrichment analysis
emapplot(ego.bp)

The enrichment map connects the two plots seen befor. We see the related pathways and how many genes with which pValues are allocated in one category.

Since we believe that dot plots are the easiest to interpret and show the most important information at first glance, we will now only display the other two GO enrichment results with these.

# Dot Plot Molecular Function
dotplot(ego.mf, showCategory=12)

Here we see the diffrent molecular functions of our biomarkers. Like in the plots above, the result here is diffrent from the result from the Gene Ontology grouping. Nevertheless, these results are not very meaningful as the listed modifications are not assigned to specific molecules.Consequently, we cannot say anything about their effects.

# Dot Plot Cellular Component 
dotplot(ego.cc, showCategory=12)

This plot shows the location of the biomarkers in the different cellular componets. Also here we see other categories than in the Barplot. But even like the polt before it, this does not give us any important information about the influence on cancer.

2.2. Reactome Pathway Analysis

Reactome Pathway contains a big database of biological processes like signal transduction, DNA repair and metabolism. Thus it can be used as analysis tool for discovering functional relationships between genes from expression data (Fabregat et al., 2018). It returns enriched pathways from a given vector of genes, thereby also performing FDR control. The statistical part works similar to the one of Gene Ontology (GO) Analysis.

As a first step, we need to load the Reactome library:

library(ReactomePA)

Here, as well, we have to translate the gene ID of our biomarkers into the ENTREZID form:

gene.df <- bitr(generalbiomarkergenes, fromType = "SYMBOL",
                toType = c("ENTREZID"),
                OrgDb = org.Hs.eg.db)
x <- enrichPathway(gene=gene.df$ENTREZID,
                   pvalueCutoff = 0.05,
                   readable = T )
head(as.data.frame(x))
##                          ID
## R-HSA-2559586 R-HSA-2559586
## R-HSA-2559583 R-HSA-2559583
## R-HSA-2262752 R-HSA-2262752
## R-HSA-2559582 R-HSA-2559582
## R-HSA-2559584 R-HSA-2559584
## R-HSA-8852276 R-HSA-8852276
##                                                                  Description
## R-HSA-2559586                  DNA Damage/Telomere Stress Induced Senescence
## R-HSA-2559583                                            Cellular Senescence
## R-HSA-2262752                                   Cellular responses to stress
## R-HSA-2559582               Senescence-Associated Secretory Phenotype (SASP)
## R-HSA-2559584 Formation of Senescence-Associated Heterochromatin Foci (SAHF)
## R-HSA-8852276      The role of GTSE1 in G2/M progression after G2 checkpoint
##               GeneRatio   BgRatio       pvalue     p.adjust       qvalue
## R-HSA-2559586     10/60  80/10554 1.948831e-11 9.140019e-09 6.482428e-09
## R-HSA-2559583     12/60 195/10554 7.268531e-10 1.704471e-07 1.208872e-07
## R-HSA-2262752     16/60 426/10554 1.096103e-09 1.713575e-07 1.215328e-07
## R-HSA-2559582      8/60 110/10554 1.757728e-07 2.060936e-05 1.461689e-05
## R-HSA-2559584      4/60  16/10554 1.632265e-06 1.531064e-04 1.085886e-04
## R-HSA-8852276      6/60  75/10554 3.883378e-06 3.035507e-04 2.152890e-04
##                                                                                                                            geneID
## R-HSA-2559586                                       HIST1H1C/HMGA2/HIST1H2BD/LMNB1/HIST2H2BE/HIST1H4H/CCNA1/TP53/HIST1H2AE/CDKN1A
## R-HSA-2559583                             HIST1H1C/HMGA2/HIST1H2BD/LMNB1/HIST2H2BE/FOS/NFKB1/HIST1H4H/CCNA1/TP53/HIST1H2AE/CDKN1A
## R-HSA-2262752 HIST1H1C/TUBB2B/HMGA2/HIST1H2BD/LMNB1/HIST2H2BE/FOS/NFKB1/PSMB10/CITED2/HIST1H4H/CCNA1/TP53/TUBB4A/HIST1H2AE/CDKN1A
## R-HSA-2559582                                                       HIST1H2BD/HIST2H2BE/FOS/NFKB1/HIST1H4H/CCNA1/HIST1H2AE/CDKN1A
## R-HSA-2559584                                                                                           HIST1H1C/HMGA2/LMNB1/TP53
## R-HSA-8852276                                                                              TUBB2B/PSMB10/GTSE1/TP53/TUBB4A/CDKN1A
##               Count
## R-HSA-2559586    10
## R-HSA-2559583    12
## R-HSA-2262752    16
## R-HSA-2559582     8
## R-HSA-2559584     4
## R-HSA-8852276     6

For visualization we can use the same type of plots as for GO Analysis.

barplot(x,showCategory=12)

The color of the bars is associated with the pValue. The smaller the value, the better the assignment. Morover, we see categories that are crealy associated with cancer growth like “TP53 Regulates Transcription of Genes Involved in G1 Cell Cycle Arrest”. Now it would be interesting to see if these genes are regulated up or down. This can be visualized in a cnet plot where the FC is stained.

# FC data for coloring 
FC=TreatedVorinostat-UntreatedVorinostat
FC<-FC[generalbiomarkergenes,]
names(FC) <- gene.df$ENTREZID

cnetplot(x,categorySize="geneNum",foldChange = FC)

We see that there are both highly and down-regulated genes in the same pathways. Unfortunately, the plot does not show whether the whole pathway is activated or inactivated.

The whole enrichment analysis can be summarized in the enrichment map. However, due to the numerous connecting lines it is no longer clear and the Fold Change is also missing in this plot.

emapplot(x)

2.3. KEGG Pathway Analysis

The next enrichment method we use is KEGG Pathway Analysis. KEGG stands for Kyoto Encyclopedia of Genes and Genomes. It is a freely accessible database that contains information on various drugs, metabolic pathways and genes, among other things (Kanehisa and Goto, 2000). The enrichment process is the same as for the other two methods presented here.

First, we need to load the libaries we will work with:

library(KEGG.db)
library(org.Hs.eg.db)
library(enrichplot)
library(DOSE)

The gene IDs must also be translated:

gene.df <- bitr(generalbiomarkergenes, fromType = "SYMBOL",
                toType = c("ENTREZID"),
                OrgDb = org.Hs.eg.db)

Now we can perform the enrichment. We did not choose a Adjust p.Value method because we could find more categories this way. When setting a specific pAdjustMethod, we would only be able to identify 5 categories.

kk <- enrichKEGG(gene=gene.df$ENTREZID,
                 pvalueCutoff = 0.05,
                 pAdjustMethod = "none")
head(summary(kk))
##                ID                             Description GeneRatio
## hsa05202 hsa05202 Transcriptional misregulation in cancer      8/49
## hsa04210 hsa04210                               Apoptosis      6/49
## hsa05203 hsa05203                    Viral carcinogenesis      7/49
## hsa05166 hsa05166 Human T-cell leukemia virus 1 infection      7/49
## hsa05161 hsa05161                             Hepatitis B      6/49
## hsa04115 hsa04115                   p53 signaling pathway      4/49
##           BgRatio       pvalue     p.adjust      qvalue
## hsa05202 186/7867 1.652177e-05 1.652177e-05 0.002173917
## hsa04210 136/7867 1.814709e-04 1.814709e-04 0.009703879
## hsa05203 201/7867 2.212484e-04 2.212484e-04 0.009703879
## hsa05166 219/7867 3.737403e-04 3.737403e-04 0.012294090
## hsa05161 163/7867 4.827212e-04 4.827212e-04 0.012703188
## hsa04115  72/7867 1.001411e-03 1.001411e-03 0.017869939
##                                          geneID Count
## hsa05202 8091/4790/330/4291/8900/7157/1026/4804     8
## hsa04210            4001/2353/4790/330/7157/637     6
## hsa05203     3017/8349/4790/8365/8900/7157/1026     7
## hsa05166     8061/2353/4790/8829/8900/7157/1026     7
## hsa05161           2353/4790/8900/7157/1026/637     6
## hsa04115                    51512/7157/1026/637     4

For the visualization of the results, we use the same plots as before, except for an additional heatplot that is based on a heatmap.

par(mfrow=c(2,1))
barplot(kk,showCategory=12)

# FC data for coloring 
FC=TreatedVorinostat-UntreatedVorinostat
FC<-FC[generalbiomarkergenes,]
names(FC) <- gene.df$ENTREZID

cnetplot(kk,categorySize="pvalue",foldChange=FC)

We see that the KEGG database has found pathways that can be directly linked to vorinostat, such as the category “Human T-cell leukemia virus 1 infection”. As already mentioned in the introduction of the board and specific analysis, vorinostat is used in leukaemia diseases. Thus, in comparison KEGG provides the best drug-specific enrichment results. These are summarized in the following heatplot. Here we can see the single genes on the x-axis and on the y-axis the pathways they are involved in marked by small boxes. Since boxes are stained with the Fold Change, we can also read here whether a gene has been down or upregulated.

heatplot(kk, foldChange=FC)

In addition, the KEGG data can be used to create so-called pathway graphs. For this the library “pathview” is needed. A pathway ID from the KEGG Pathway enrichment result is entered into the function and the graphic is created. This will be demonstrated in the following using the first listed pathway from the previous KEGG enrichment analysis.

Load the library:

library(pathview)

Load the pathway graph:

pathview(gene.data = FC, 
         pathway.id = "hsa05202", 
         species = "hsa", 
         limit = list(gene=5, cpd=1))

Here we see the pathway “Transcriptional misregulation in cancer - Homo sapiens (human)”. There are diffent flow charts with the corresponding genes for dirffent cancer types. With this information one could possibly make predictions in individual cancer therapy. If a patient’s gene expression is known, it can be compared with our biomarkers and a prediction can be made of the ways in which vorinostat might work. In the pathway graphs, individual genes and reaction pathways can be specifically observed.


3. Gene Set Variation Analysis

Finally, we performed the Gene Set Variation Analysis, short GSVA. GSVA is an analysis method that exports gene expression profiles into pathway or siganture summary. Consequently, it allows easier interpretation of the data and helps to model the pathway activation (Haenzelmann et al., 2013).

First we perform the GSVA enrichment, resulting in an enrichment score for each gene and each sample:

# create data needed for GSVA Analysis 
generalbiomarkergenes=as.list(generalbiomarkergenes)
FC=TreatedVorinostat-UntreatedVorinostat

library(GSVA)
g<- gsva(FC,
         generalbiomarkergenes,
         mx.diff=TRUE,
         verbose=TRUE)

The enrichment results can be visualized by an heatmap :

heat <- t(scale(t(g)))

myCol <- colorRampPalette(c("dodgerblue", "black", "yellow"))(100)
myBreaks <- seq(-1.5, 1.5, length.out=101)



library(gplots)
png(filename = "GSVAheat.png")
GSVAheat= heatmap.2(heat,
          col=myCol,
          breaks=myBreaks,
          main="GSVA Enrichment Score of Biomarker",
          xlab="genes",
          ylab="samples",
          labRow="", 
          labCol="",
          key=TRUE,
          keysize=1.0,
          key.title="",
          key.xlab="Enrichment Z-score",
          scale="none", 
          density.info="none",
          reorderfun=function(d,w) reorder(d, w, agglo.FUN=mean),
          trace="none",
          cexRow=1.0,
          cexCol=1.0,
          distfun=function(x) dist(x, method="euclidean"),
          hclustfun=function(x) hclust(x, method="ward.D2"))

The enrichement score summarizes the gene expression of each single genes and is given als the maximum derivation of zero. In our heatmap we can see two big gene groups looking on the left side, which also can be found in the dendogram of the heatmap.

Since we have a mapping of a gene and a sample, we see that the gsva enrichment did not work properly, because we were supposed to get groups of genes in a particular pathway. Such a heatmap with pathways would look like the following. It was created by Haenzelmann et al. with leukemia data.

We can see the diffrent samples grouped by pathways. The pictures shows the diffrentially activated pathways in the Leukemia data set.


4. Conclusion/Discussion

Finally, looking at all enrichment methods, it can be said that KEGG is best suited for our purposes. On the one hand it provides cancer- and drug-specific pathway information and on the other hand it can be used to create pathway graphs that could be used for predictions in individual therapy. Gene Ontology Analysis provides only very superficial enrichment results and is more suitable for sorting genes by broad functions. It has been shown how important the inclusion of statistical parameters is. Without these, one often gets classifications that apply to all genes and which are not specific. The Reactome Pathway Analysis had provided us with more specific analysis results, including cancer-specific metabolic pathways. However, in contrast to KEGG, the information about the progression of the pathways and the more precise function of individual genes was missing. Since the GSVA enrichment did not work properly with us, no statement can be made about this method.


5. Literatur

Fabregat, A., Jupe, S., Matthews, L., Sidiropoulos, K., Gillespie, M., Garapati, P., Haw, R., Jassal, B., Korninger, F., May, B., et al. (2018). The Reactome Pathway Knowledgebase. Nucleic Acids Res 46, D649-d655.

Haenzelmann, S., Castelo, R., and Guinney, J. (2013). GSVA: gene set variation analysis for microarray and RNA-Seq data. BMC Bioinformatics 14, 7.

Kanehisa, M., and Goto, S. (2000). KEGG: kyoto encyclopedia of genes and genomes. Nucleic acids research 28, 27-30.

Yon Rhee, S., Wood, V., Dolinski, K., and Draghici, S. (2008). Use and misuse of the gene ontology annotations. Nature Reviews Genetics 9, 509.